En los años de la pandemia, en los Estados Unidos de América, el comportamiento del uso de tarjetas de crédito ha variado bastante. El promedio de la deuda de tarjeta de crédito creció en 52% entre el 2018 y el 2019, sin embargo, este porcentaje ha caído significativamente en el 2020 (ValuePenguin)(Household Debt and Credit Report, NewYorkFed).
Estados Unidos tiene una deuda de 807 mil millones de dolares distribuidas en 506 millones de tarjetas de crédito mientras que la deuda promedio de una familia Estadounidense es de 6,270 dolares. Estos números incrementan cada año hasta el 2020 donde se vio un decremento debido al COVID-19. En el 2020 la deuda promedio por familia disminuyó a 5,315 dolares, casi 1,000 dolares. La pandemia tuvo un gran impacto en la vida de todos lo cual causó cambios que atribuyen estos cambios de deudas. Entre estos cambios está el hecho que la gente utilizaba menos su dinero en compras y gastos al igual que muchos recibieron dinero suplementario por desempleo o por otras razones que pueden utilizar para estas deudas (Resendiz, 2021).
Además, para poder conocer más sobre las posibles causas del decremento de deudas se investigó sobre diferentes métodos para resolver dichas situaciones económicas. La primera es a través de las compañías de liquidación de deudas. Estas compañías ofrecen negociar con los emisores de tarjetas de crédito para que el endeudado tenga la opción de disminuir la cantidad que debe pero debe tomar en cuenta que esa negociación puede tomar tiempo y debe pagarle a dicha compañía de liquidación. Además, debe tener cuidado con compañías fraudulentas que simplemente lo pondrán en una peor situación (Comision Federal de Comercio).
Otra opción sería que el individuo con impago negocie directamente con los acreedores. Es posible que le den una tasa de interés más baja la cual facilita el pago de deuda. Una tasa más baja indica que una menor cantidad de cada pago mensual que realice se consuma a causa de los cargos por intereses acumulados y por ende se podrá pagar la deuda de manera más rápida. Cabe mencionar que maximizar su flujo de efectivo es crucial. Ya sea al conseguir nuevos/mejores empleos o minimizar sus gastos cada cantidad ayuda. Además, debe organizar y priorizar sus deudas. Al analizar y ordenar todas las deudas que tiene lo ayudará desarrollar un plan que indique cómo y en qué orden irá pagando (Debt).
Se puede observar que estos datos y soluciones dependen mucho de los individuos que tienen la deuda. Por lo mismo es importante saber que características tiene un cliente que cae en incumplimiento de pago de tarjeta de crédito. El objetivo de este proyecto es crear modelos de aprendizaje automático que permita predecir qué tipo de clientes podrán caer en impago, o no. Para esto se cuenta con un conjunto de datos que contiene información sobre pagos predeterminados, factores demográficos, datos crediticios, historial de pagos y extractos de cuentas de clientes de tarjetas de crédito en Taiwán desde abril de 2005 hasta septiembre de 2005.
Generales:
Especificos:
Descripcion del dataset como estaba:
Hay 25 variables (22 cuantitativas, 3 categoricas):
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import seaborn as sn
import matplotlib.pyplot as plt
from quickda.clean_data import *
from quickda.explore_data import *
from quickda.clean_data import *
df = pd.read_csv("UCI_Credit_Card.csv")
df.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
| 2 | 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
| 3 | 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
| 4 | 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
5 rows × 25 columns
explore(df, method="summarize")
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | int64 | 30000 | 0 | 0.0 | 30000 | 1.0 | 7500.75 | 15000.5 | 22500.25 | 30000.0 | 15000.500 | 15000.5 | 8660.398 | 0.000 |
| LIMIT_BAL | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.00 | 140000.0 | 240000.00 | 1000000.0 | 167484.323 | 140000.0 | 129747.662 | 0.993 |
| SEX | int64 | 30000 | 0 | 0.0 | 2 | 1.0 | 1.00 | 2.0 | 2.00 | 2.0 | 1.604 | 2.0 | 0.489 | -0.424 |
| EDUCATION | int64 | 30000 | 0 | 0.0 | 7 | 0.0 | 1.00 | 2.0 | 2.00 | 6.0 | 1.853 | 2.0 | 0.790 | 0.971 |
| MARRIAGE | int64 | 30000 | 0 | 0.0 | 4 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 | 1.552 | 2.0 | 0.522 | -0.019 |
| AGE | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.00 | 34.0 | 41.00 | 79.0 | 35.486 | 34.0 | 9.218 | 0.732 |
| PAY_0 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.017 | 0.0 | 1.124 | 0.732 |
| PAY_2 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.134 | 0.0 | 1.197 | 0.791 |
| PAY_3 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.166 | 0.0 | 1.197 | 0.841 |
| PAY_4 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.221 | 0.0 | 1.169 | 1.000 |
| PAY_5 | int64 | 30000 | 0 | 0.0 | 10 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.266 | 0.0 | 1.133 | 1.008 |
| PAY_6 | int64 | 30000 | 0 | 0.0 | 10 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.291 | 0.0 | 1.150 | 0.948 |
| BILL_AMT1 | float64 | 30000 | 0 | 0.0 | 22723 | -165580.0 | 3558.75 | 22381.5 | 67091.00 | 964511.0 | 51223.331 | 22381.5 | 73635.861 | 2.664 |
| BILL_AMT2 | float64 | 30000 | 0 | 0.0 | 22346 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 | 49179.075 | 21200.0 | 71173.769 | 2.705 |
| BILL_AMT3 | float64 | 30000 | 0 | 0.0 | 22026 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 | 47013.155 | 20088.5 | 69349.387 | 3.088 |
| BILL_AMT4 | float64 | 30000 | 0 | 0.0 | 21548 | -170000.0 | 2326.75 | 19052.0 | 54506.00 | 891586.0 | 43262.949 | 19052.0 | 64332.856 | 2.822 |
| BILL_AMT5 | float64 | 30000 | 0 | 0.0 | 21010 | -81334.0 | 1763.00 | 18104.5 | 50190.50 | 927171.0 | 40311.401 | 18104.5 | 60797.156 | 2.876 |
| BILL_AMT6 | float64 | 30000 | 0 | 0.0 | 20604 | -339603.0 | 1256.00 | 17071.0 | 49198.25 | 961664.0 | 38871.760 | 17071.0 | 59554.108 | 2.847 |
| PAY_AMT1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.00 | 2100.0 | 5006.00 | 873552.0 | 5663.580 | 2100.0 | 16563.280 | 14.668 |
| PAY_AMT2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.00 | 2009.0 | 5000.00 | 1684259.0 | 5921.164 | 2009.0 | 23040.870 | 30.454 |
| PAY_AMT3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.00 | 1800.0 | 4505.00 | 896040.0 | 5225.682 | 1800.0 | 17606.961 | 17.217 |
| PAY_AMT4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.00 | 1500.0 | 4013.25 | 621000.0 | 4826.077 | 1500.0 | 15666.160 | 12.905 |
| PAY_AMT5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.50 | 1500.0 | 4031.50 | 426529.0 | 4799.388 | 1500.0 | 15278.306 | 11.127 |
| PAY_AMT6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.00 | 528666.0 | 5215.503 | 1500.0 | 17777.466 | 10.641 |
| default.payment.next.month | int64 | 30000 | 0 | 0.0 | 2 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | 0.221 | 0.0 | 0.415 | 1.344 |
profile = ProfileReport(df, minimal=True)
profile
df = clean(df, method = "standardize")
df.head()
| id | limit_bal | sex | education | marriage | age | pay_0 | pay_2 | pay_3 | pay_4 | ... | bill_amt4 | bill_amt5 | bill_amt6 | pay_amt1 | pay_amt2 | pay_amt3 | pay_amt4 | pay_amt5 | pay_amt6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
| 2 | 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
| 3 | 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
| 4 | 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
5 rows × 25 columns
to_categoric = ["sex", "education", "marriage",
"pay_0", "pay_2","pay_3", "pay_4", "pay_5","pay_6","default.payment.next.month"]
df = clean(df, method = 'dtypes', columns = to_categoric,
dtype='category')
df = df.rename(columns = {'pay_0': 'pay_1',}, inplace = False)
explore(df, method="summarize")
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:26: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:30: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.0 | 34.0 | 41.0 | 79.0 | 35.4855 | 34.0 | 9.217904 | 0.732246 |
| bill_amt1 | float64 | 30000 | 0 | 0.0 | 22723 | -165580.0 | 3558.75 | 22381.5 | 67091.0 | 964511.0 | 51223.3309 | 22381.5 | 73635.860576 | 2.663861 |
| bill_amt2 | float64 | 30000 | 0 | 0.0 | 22346 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 | 49179.075167 | 21200.0 | 71173.768783 | 2.705221 |
| bill_amt3 | float64 | 30000 | 0 | 0.0 | 22026 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 | 47013.1548 | 20088.5 | 69349.387427 | 3.08783 |
| bill_amt4 | float64 | 30000 | 0 | 0.0 | 21548 | -170000.0 | 2326.75 | 19052.0 | 54506.0 | 891586.0 | 43262.948967 | 19052.0 | 64332.856134 | 2.821965 |
| bill_amt5 | float64 | 30000 | 0 | 0.0 | 21010 | -81334.0 | 1763.0 | 18104.5 | 50190.5 | 927171.0 | 40311.400967 | 18104.5 | 60797.15577 | 2.87638 |
| bill_amt6 | float64 | 30000 | 0 | 0.0 | 20604 | -339603.0 | 1256.0 | 17071.0 | 49198.25 | 961664.0 | 38871.7604 | 17071.0 | 59554.107537 | 2.846645 |
| default.payment.next.month | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
| education | category | 30000 | 0 | 0.0 | 7 | - | - | - | - | - | - | - | - | - |
| id | int64 | 30000 | 0 | 0.0 | 30000 | 1.0 | 7500.75 | 15000.5 | 22500.25 | 30000.0 | 15000.5 | 15000.5 | 8660.398374 | 0.0 |
| limit_bal | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.0 | 140000.0 | 240000.0 | 1000000.0 | 167484.322667 | 140000.0 | 129747.661567 | 0.992867 |
| marriage | category | 30000 | 0 | 0.0 | 4 | - | - | - | - | - | - | - | - | - |
| pay_1 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_2 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_3 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_4 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_5 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_6 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_amt1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.0 | 2100.0 | 5006.0 | 873552.0 | 5663.5805 | 2100.0 | 16563.280354 | 14.668364 |
| pay_amt2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.0 | 2009.0 | 5000.0 | 1684259.0 | 5921.1635 | 2009.0 | 23040.870402 | 30.453817 |
| pay_amt3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.0 | 1800.0 | 4505.0 | 896040.0 | 5225.6815 | 1800.0 | 17606.96147 | 17.216635 |
| pay_amt4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.0 | 1500.0 | 4013.25 | 621000.0 | 4826.076867 | 1500.0 | 15666.159744 | 12.904985 |
| pay_amt5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.5 | 1500.0 | 4031.5 | 426529.0 | 4799.387633 | 1500.0 | 15278.305679 | 11.127417 |
| pay_amt6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.0 | 528666.0 | 5215.502567 | 1500.0 | 17777.465775 | 10.640727 |
| sex | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
df = clean(df, method = 'dropcols', columns = ['id'])
df = clean(df, method = "replaceval",
columns = ["education"],
to_replace = [0,5,6],
value = 5)
df = clean(df, method = "replaceval",
columns = ["marriage"],
to_replace = [0],
value = 3)
Estos valores si tienen un significato en el dataset con respecto a la realizacion de los pagos por lo que no se pueden modificar a nulo.
correlation = df.corr()
plt.figure(figsize = (13, 10))
sn.heatmap(correlation, annot=True)
plt.show()
dfSimp = df
promedio = (dfSimp["bill_amt1"]+dfSimp["bill_amt2"]+dfSimp["bill_amt3"]
+dfSimp["bill_amt4"]+dfSimp["bill_amt5"]+dfSimp["bill_amt6"])/6
dfSimp["prom_bill_amt"] = promedio
dfSimp = clean(dfSimp, method = 'dropcols', columns = ['bill_amt1',"bill_amt2","bill_amt3","bill_amt4","bill_amt5","bill_amt6"])
cols = ['limit_bal', 'sex','education','marriage','age','pay_1','pay_2','pay_3',
'pay_4','pay_5','pay_6','prom_bill_amt','pay_amt1','pay_amt2','pay_amt3','pay_amt4','pay_amt5',
'pay_amt6','default.payment.next.month']
dfSimp = dfSimp[cols]
correlation = dfSimp.corr()
plt.figure(figsize = (13, 10))
sn.heatmap(correlation, annot=True)
plt.show()
explore(dfSimp, method="summarize")
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:26: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:30: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.0 | 34.0 | 41.0 | 79.0 | 35.4855 | 34.0 | 9.217904 | 0.732246 |
| default.payment.next.month | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
| education | category | 30000 | 0 | 0.0 | 5 | - | - | - | - | - | - | - | - | - |
| limit_bal | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.0 | 140000.0 | 240000.0 | 1000000.0 | 167484.322667 | 140000.0 | 129747.661567 | 0.992867 |
| marriage | category | 30000 | 0 | 0.0 | 3 | - | - | - | - | - | - | - | - | - |
| pay_1 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_2 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_3 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_4 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_5 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_6 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_amt1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.0 | 2100.0 | 5006.0 | 873552.0 | 5663.5805 | 2100.0 | 16563.280354 | 14.668364 |
| pay_amt2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.0 | 2009.0 | 5000.0 | 1684259.0 | 5921.1635 | 2009.0 | 23040.870402 | 30.453817 |
| pay_amt3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.0 | 1800.0 | 4505.0 | 896040.0 | 5225.6815 | 1800.0 | 17606.96147 | 17.216635 |
| pay_amt4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.0 | 1500.0 | 4013.25 | 621000.0 | 4826.076867 | 1500.0 | 15666.159744 | 12.904985 |
| pay_amt5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.5 | 1500.0 | 4031.5 | 426529.0 | 4799.387633 | 1500.0 | 15278.305679 | 11.127417 |
| pay_amt6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.0 | 528666.0 | 5215.502567 | 1500.0 | 17777.465775 | 10.640727 |
| prom_bill_amt | float64 | 30000 | 0 | 0.0 | 27370 | -56043.166667 | 4781.333333 | 21051.833333 | 57104.416667 | 877313.833333 | 44976.9452 | 21051.833333 | 63260.72186 | 2.734744 |
| sex | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
#Creacion de graficas para visualizar cada columna del dataset
dfSimp.hist(figsize=(30, 25))
array([[<AxesSubplot:title={'center':'limit_bal'}>,
<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'prom_bill_amt'}>],
[<AxesSubplot:title={'center':'pay_amt1'}>,
<AxesSubplot:title={'center':'pay_amt2'}>,
<AxesSubplot:title={'center':'pay_amt3'}>],
[<AxesSubplot:title={'center':'pay_amt4'}>,
<AxesSubplot:title={'center':'pay_amt5'}>,
<AxesSubplot:title={'center':'pay_amt6'}>]], dtype=object)
#dfSimp = clean(dfSimp, method='outliers', columns=["pay_amt1","pay_amt3","pay_amt5"])
#dfSimp.index = list(range(0,23241))
explore(dfSimp, method="summarize")
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:26: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:30: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.0 | 34.0 | 41.0 | 79.0 | 35.4855 | 34.0 | 9.217904 | 0.732246 |
| default.payment.next.month | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
| education | category | 30000 | 0 | 0.0 | 5 | - | - | - | - | - | - | - | - | - |
| limit_bal | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.0 | 140000.0 | 240000.0 | 1000000.0 | 167484.322667 | 140000.0 | 129747.661567 | 0.992867 |
| marriage | category | 30000 | 0 | 0.0 | 3 | - | - | - | - | - | - | - | - | - |
| pay_1 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_2 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_3 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_4 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_5 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_6 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_amt1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.0 | 2100.0 | 5006.0 | 873552.0 | 5663.5805 | 2100.0 | 16563.280354 | 14.668364 |
| pay_amt2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.0 | 2009.0 | 5000.0 | 1684259.0 | 5921.1635 | 2009.0 | 23040.870402 | 30.453817 |
| pay_amt3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.0 | 1800.0 | 4505.0 | 896040.0 | 5225.6815 | 1800.0 | 17606.96147 | 17.216635 |
| pay_amt4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.0 | 1500.0 | 4013.25 | 621000.0 | 4826.076867 | 1500.0 | 15666.159744 | 12.904985 |
| pay_amt5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.5 | 1500.0 | 4031.5 | 426529.0 | 4799.387633 | 1500.0 | 15278.305679 | 11.127417 |
| pay_amt6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.0 | 528666.0 | 5215.502567 | 1500.0 | 17777.465775 | 10.640727 |
| prom_bill_amt | float64 | 30000 | 0 | 0.0 | 27370 | -56043.166667 | 4781.333333 | 21051.833333 | 57104.416667 | 877313.833333 | 44976.9452 | 21051.833333 | 63260.72186 | 2.734744 |
| sex | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
Despues de eliminar algunos datos atipicos nos quedamos con 23241 observaciones que es alrededor del 77% del data set original, lo que nos parece aun una buena cantidad de elementos con los que trabajar.
#Creacion de graficas para visualizar cada columna del dataset
dfSimp.hist(figsize=(30, 25))
array([[<AxesSubplot:title={'center':'limit_bal'}>,
<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'prom_bill_amt'}>],
[<AxesSubplot:title={'center':'pay_amt1'}>,
<AxesSubplot:title={'center':'pay_amt2'}>,
<AxesSubplot:title={'center':'pay_amt3'}>],
[<AxesSubplot:title={'center':'pay_amt4'}>,
<AxesSubplot:title={'center':'pay_amt5'}>,
<AxesSubplot:title={'center':'pay_amt6'}>]], dtype=object)
dfSimp['education'].value_counts().plot(kind='bar', title = "Educacion")
<AxesSubplot:title={'center':'Educacion'}>
dfSimp['marriage'].value_counts().plot(kind='bar', title = "Estado civil")
<AxesSubplot:title={'center':'Estado civil'}>
dfSimp['sex'].value_counts().plot(kind='bar', title = "Sexo")
<AxesSubplot:title={'center':'Sexo'}>
dfSimp['default.payment.next.month'].value_counts().plot(kind='bar', title = "Impago")
<AxesSubplot:title={'center':'Impago'}>
pd.crosstab(dfSimp.education, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6), title = "Analisis de pago por educacion")
plt.xlabel('Educacion')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
pd.crosstab(dfSimp.age, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6), title = "Analisis de pago por edad")
plt.xlabel('Edad')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
pd.crosstab(dfSimp.sex, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6), title = "Analisis de pago por sexo")
plt.xlabel('Genero')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
pd.crosstab(dfSimp.marriage, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6),title = "Analisis de pago por estado civil")
plt.xlabel('Estado Civil')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
sns.catplot(x="default.payment.next.month", y="limit_bal", data=dfSimp)
<seaborn.axisgrid.FacetGrid at 0x20a063bdd30>
sns.catplot(x="education", y="limit_bal", data=dfSimp)
<seaborn.axisgrid.FacetGrid at 0x20a0838fee0>
sns.catplot(x="default.payment.next.month", y="prom_bill_amt", data=dfSimp)
<seaborn.axisgrid.FacetGrid at 0x20a05ca71f0>
El conjunto de datos tras las operaciones de limpieza cuenta con 23241 observaciones y 19 variables. De estas variables:
Al analizar las variables cualitativas podemos notar ciertas características importantes como:
De las variables cuantitativos podemos observar lo siguiente:
Al cruzar variables podemos observar lo siguiente:
dfSimp = pd.concat([dfSimp.drop(columns=['sex'], axis=1), pd.get_dummies(dfSimp['sex'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['marriage'], axis=1), pd.get_dummies(dfSimp['marriage'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['education'], axis=1), pd.get_dummies(dfSimp['education'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['pay_1'], axis=1), pd.get_dummies(dfSimp['pay_1'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['pay_2'], axis=1), pd.get_dummies(dfSimp['pay_2'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['pay_3'], axis=1), pd.get_dummies(dfSimp['pay_3'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['pay_4'], axis=1), pd.get_dummies(dfSimp['pay_4'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['pay_5'], axis=1), pd.get_dummies(dfSimp['pay_5'], drop_first=True)], axis=1)
dfSimp = pd.concat([dfSimp.drop(columns=['pay_6'], axis=1), pd.get_dummies(dfSimp['pay_6'], drop_first=True)], axis=1)
dfSimp
| limit_bal | age | prom_bill_amt | pay_amt1 | pay_amt2 | pay_amt3 | pay_amt4 | pay_amt5 | pay_amt6 | default.payment.next.month | ... | 8 | -1 | 0 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20000.0 | 24 | 1284.000000 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 120000.0 | 26 | 2846.166667 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 90000.0 | 34 | 16942.166667 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 50000.0 | 37 | 38555.666667 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 50000.0 | 57 | 18223.166667 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29995 | 220000.0 | 39 | 120891.500000 | 8500.0 | 20000.0 | 5003.0 | 3047.0 | 5000.0 | 1000.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29996 | 150000.0 | 43 | 3530.333333 | 1837.0 | 3526.0 | 8998.0 | 129.0 | 0.0 | 0.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29997 | 30000.0 | 37 | 11749.333333 | 0.0 | 0.0 | 22000.0 | 4200.0 | 2000.0 | 3100.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29998 | 80000.0 | 41 | 44435.166667 | 85900.0 | 3409.0 | 1178.0 | 1926.0 | 52964.0 | 1804.0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29999 | 50000.0 | 46 | 38479.000000 | 2078.0 | 1800.0 | 1430.0 | 1000.0 | 1000.0 | 1000.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
30000 rows × 75 columns
from sklearn.model_selection import train_test_split
X = dfSimp.drop(columns='default.payment.next.month')
y = dfSimp['default.payment.next.month']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
X_train
| limit_bal | age | prom_bill_amt | pay_amt1 | pay_amt2 | pay_amt3 | pay_amt4 | pay_amt5 | pay_amt6 | 2 | ... | 8 | -1 | 0 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20551 | 50000.0 | 28 | 41042.000000 | 2000.0 | 2100.0 | 1300.0 | 650.0 | 600.0 | 1652.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17209 | 90000.0 | 28 | 3195.833333 | 1950.0 | 7956.0 | 499.0 | 0.0 | 5990.0 | 0.0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2134 | 160000.0 | 66 | 71245.500000 | 3400.0 | 3000.0 | 2600.0 | 2600.0 | 2700.0 | 2800.0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2958 | 20000.0 | 36 | -25.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 716 | 160000.0 | 27 | 83043.833333 | 4000.0 | 4000.0 | 3500.0 | 4000.0 | 4000.0 | 2500.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5695 | 20000.0 | 28 | 18199.166667 | 1600.0 | 1700.0 | 1000.0 | 1000.0 | 556.0 | 193.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8006 | 80000.0 | 27 | 54338.166667 | 2896.0 | 1722.0 | 1619.0 | 1665.0 | 1700.0 | 1833.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17745 | 340000.0 | 56 | 6241.166667 | 5399.0 | 12353.0 | 8619.0 | 0.0 | 0.0 | 0.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17931 | 230000.0 | 32 | 238198.000000 | 10183.0 | 10057.0 | 19000.0 | 9100.0 | 0.0 | 16500.0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13151 | 180000.0 | 34 | 146319.500000 | 6500.0 | 5600.0 | 0.0 | 12000.0 | 0.0 | 11800.0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
24000 rows × 74 columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import precision_recall_curve
from matplotlib import pyplot
accuracy = []
recall = []
f1 = []
roc = []
from sklearn import svm
svmc = svm.SVC(random_state=51, probability=True)
svmc.fit(X_train, y_train)
y_pred = svmc.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.84 0.96 0.90 4690
1 0.71 0.36 0.47 1310
accuracy 0.83 6000
macro avg 0.77 0.66 0.68 6000
weighted avg 0.81 0.83 0.80 6000
accuracy.append(accuracy_score(y_test, y_pred))
recall.append(recall_score(y_test, y_pred))
f1.append(f1_score(y_test, y_pred))
roc.append(roc_auc_score(y_test, y_pred))
cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 257.44, 'Predicted label')
svm_disp = plot_roc_curve(svmc, X_test, y_test)
plt.plot([0, 1], [0, 1], color="red", lw=2, linestyle="--")
plt.show()
m_probs = svmc.predict_proba(X_test)
m_precision, m_recall, _ = precision_recall_curve(y_test, m_probs[:, 1])
no_skill = len(y_test[y_test==1]) / len(y_test)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(m_recall, m_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=51)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.84 0.96 0.90 4690
1 0.70 0.34 0.46 1310
accuracy 0.82 6000
macro avg 0.77 0.65 0.68 6000
weighted avg 0.81 0.82 0.80 6000
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
accuracy.append(accuracy_score(y_test, y_pred))
recall.append(recall_score(y_test, y_pred))
f1.append(f1_score(y_test, y_pred))
roc.append(roc_auc_score(y_test, y_pred))
cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 257.44, 'Predicted label')
y_pred_proba = lr.predict_proba(X_test)[::,1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
m_probs = lr.predict_proba(X_test)
m_precision, m_recall, _ = precision_recall_curve(y_test, m_probs[:, 1])
no_skill = len(y_test[y_test==1]) / len(y_test)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(m_recall, m_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=51)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.84 0.95 0.89 4690
1 0.67 0.35 0.46 1310
accuracy 0.82 6000
macro avg 0.75 0.65 0.68 6000
weighted avg 0.80 0.82 0.80 6000
accuracy.append(accuracy_score(y_test, y_pred))
recall.append(recall_score(y_test, y_pred))
f1.append(f1_score(y_test, y_pred))
roc.append(roc_auc_score(y_test, y_pred))
cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 257.44, 'Predicted label')
rf_disp = plot_roc_curve(rf, X_test, y_test)
plt.plot([0, 1], [0, 1], color="red", lw=2, linestyle="--")
plt.show()
m_probs = rf.predict_proba(X_test)
m_precision, m_recall, _ = precision_recall_curve(y_test, m_probs[:, 1])
no_skill = len(y_test[y_test==1]) / len(y_test)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(m_recall, m_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.80 0.98 0.88 4690
1 0.61 0.11 0.18 1310
accuracy 0.79 6000
macro avg 0.70 0.54 0.53 6000
weighted avg 0.76 0.79 0.73 6000
accuracy.append(accuracy_score(y_test, y_pred))
recall.append(recall_score(y_test, y_pred))
f1.append(f1_score(y_test, y_pred))
roc.append(roc_auc_score(y_test, y_pred))
cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 257.44, 'Predicted label')
gnb_disp = plot_roc_curve(gnb, X_test, y_test)
plt.plot([0, 1], [0, 1], color="red", lw=2, linestyle="--")
plt.show()
m_probs = gnb.predict_proba(X_test)
m_precision, m_recall, _ = precision_recall_curve(y_test, m_probs[:, 1])
no_skill = len(y_test[y_test==1]) / len(y_test)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(m_recall, m_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
import xgboost as xgb
param_grid = {
"max_depth": [3, 4, 5, 7],
"learning_rate": [0.1, 0.01, 0.05],
"gamma": [0, 0.25, 1],
"subsample": [0.5, 0.6, 0.7, 0.8, 0.9, 0.1],
"colsample_bytree": [0.5, 0.6, 0.7, 0.8, 0.9, 0.1],
}
from sklearn.model_selection import GridSearchCV
xgb_cl = xgb.XGBClassifier(objective="binary:logistic")
grid_cv = GridSearchCV(xgb_cl, param_grid, n_jobs=-1, cv=3, scoring="roc_auc")
_ = grid_cv.fit(X_train, y_train)
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[06:32:35] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
grid_cv.best_score_
0.7821880360671639
grid_cv.best_params_
{'colsample_bytree': 0.6,
'gamma': 1,
'learning_rate': 0.05,
'max_depth': 7,
'subsample': 0.9}
Se encontraron valores optimizados para los parametros. Sin embargo, max_depth y gamma tienen el valor mas alto del rango que se dio por lo que debemos probar mas valores.
#Valores fijos
param_grid["learning_rate"] = [0.05]
param_grid["subsample"] = [0.9]
param_grid["colsample_bytree"] = [0.6]
#Nuevos valores
param_grid["gamma"] = [3, 5, 7]
param_grid["max_depth"] = [9, 15, 20]
grid_cv_2 = GridSearchCV(xgb_cl, param_grid, cv=3, scoring="roc_auc", n_jobs=-1)
_ = grid_cv_2.fit(X_train, y_train)
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[06:42:08] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
grid_cv_2.best_score_
0.7822423045474644
grid_cv_2.best_params_
{'colsample_bytree': 0.6,
'gamma': 7,
'learning_rate': 0.05,
'max_depth': 9,
'subsample': 0.9}
xgbf = xgb.XGBClassifier(
colsample_bytree= 0.6,
gamma= 7,
learning_rate= 0.05,
max_depth= 9,
subsample= 0.9,
objective="binary:logistic"
)
xgbf.fit(X_train, y_train)
y_pred = xgbf.predict(X_test)
print(classification_report(y_test, y_pred))
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[08:14:47] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
precision recall f1-score support
0 0.84 0.96 0.90 4690
1 0.70 0.34 0.46 1310
accuracy 0.82 6000
macro avg 0.77 0.65 0.68 6000
weighted avg 0.81 0.82 0.80 6000
accuracy.append(accuracy_score(y_test, y_pred))
recall.append(recall_score(y_test, y_pred))
f1.append(f1_score(y_test, y_pred))
roc.append(roc_auc_score(y_test, y_pred))
cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 257.44, 'Predicted label')
xgbf_disp = plot_roc_curve(svmc, X_test, y_test)
plt.plot([0, 1], [0, 1], color="red", lw=2, linestyle="--")
plt.show()
m_probs = svmc.predict_proba(X_test)
m_precision, m_recall, _ = precision_recall_curve(y_test, m_probs[:, 1])
no_skill = len(y_test[y_test==1]) / len(y_test)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(m_recall, m_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
modelos = ["SVM", "Reg. Logistica", "Random Forest", "Naive Bayes", "XGBoost"]
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_title("Comparacion accuracy")
ax.set_ylabel('Accuracy %')
bars = plt.bar(modelos, height=accuracy)
for bar in bars:
yval = round(bar.get_height(), 2)
print(yval)
plt.text(bar.get_x() + 0.3, yval + .01, yval)
plt.show()
0.83 0.82 0.82 0.79 0.82
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_title("Comparacion recall")
ax.set_ylabel('Recall %')
bars = plt.bar(modelos, height=recall)
for bar in bars:
yval = round(bar.get_height(), 2)
print(yval)
plt.text(bar.get_x() + 0.3, yval + .005, yval)
plt.show()
0.36 0.34 0.35 0.11 0.34
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_title("Comparacion f1")
ax.set_ylabel('f1 %')
bars = plt.bar(modelos, height=f1)
for bar in bars:
yval = round(bar.get_height(), 2)
print(yval)
plt.text(bar.get_x() + 0.3, yval + .005, yval)
plt.show()
0.47 0.46 0.46 0.18 0.46
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_title("Comparacion roc_auc")
ax.set_ylabel('roc_auc %')
bars = plt.bar(modelos, height=roc)
for bar in bars:
yval = round(bar.get_height(), 2)
print(yval)
plt.text(bar.get_x() + 0.3, yval + .01, yval)
plt.show()
0.66 0.65 0.65 0.54 0.65